{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "ba8527d4-7f70-4ed2-986e-9f2a48967690",
   "metadata": {},
   "source": [
    "# Custom Environment"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2bc6c081-51dd-4df3-8a8c-f54ffb1feeaa",
   "metadata": {},
   "source": [
    "You can find a repo with a corresponding example/template at [https://github.com/rl-tools/example](https://github.com/rl-tools/example)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae6ddca1-73c8-4280-8760-a89ddc9bf24c",
   "metadata": {},
   "source": [
    "As always we first include the elementary operations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "cc4f594c-fd4f-4e92-95ed-066d96010587",
   "metadata": {},
   "outputs": [],
   "source": [
    "#define RL_TOOLS_BACKEND_ENABLE_OPENBLAS\n",
    "#include <rl_tools/operations/cpu_mux.h>\n",
    "#include <rl_tools/nn/operations_cpu_mux.h>\n",
    "#include <rl_tools/nn_models/operations_cpu.h>\n",
    "namespace rlt = rl_tools;\n",
    "#pragma cling load(\"openblas\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec2f3181-7f15-4040-bf8b-63f786b3bcd5",
   "metadata": {},
   "source": [
    "Next we define the datastructures for our new environment. The main data structures are for the state of the environmnet (`MyPendulumState`) and for the environment itself (`MyPendulum`). As usual in RLtools, we assemble all template parameters of the environment into a specification (`MyPendulumSpecification`) so that we do not need to repeat them in every function template. Furthermore we separate out the parameters. With RLtools environments we distinguish between three levels of \"state\":\n",
    "- **Environment**: Compile-time, should not change at runtime (though this is is not enforced to allow for hackability)\n",
    "- **Parameters**: Constant throughout an episode. This allows for e.g. domain randomization. It can also carry cues for the visualization (the constness during an episode is also not enforce but considered a best practice)\n",
    "- **State**: Sampled from the initial distribution at the beginning of an episode and can then change on every step\n",
    "\n",
    "To work with the RLtools API, the environment data structure (`MyPendulum`) needs to have the fields:\n",
    "- `T`: Floating point type\n",
    "- `TI`: Index/unsigned integer type\n",
    "- `Parameters`: Parameters datastructure (can also purly contain compile-time `constexpr`), should be a [Plain Old Data (POD)](https://en.wikipedia.org/wiki/Passive_data_structure) structure so that it works well on GPUs and microcontrollers\n",
    "- `State`: State datastructure, should be a [Plain Old Data (POD)](https://en.wikipedia.org/wiki/Passive_data_structure) for the same reasons\n",
    "- `OBSERVATION_DIM`: Dimension of the observations\n",
    "- `ACTION_DIM`: Dimension of the actions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "7949edef-7557-4e14-9e76-1b5958b5ba77",
   "metadata": {},
   "outputs": [],
   "source": [
    "template <typename T>\n",
    "struct MyPendulumParameters {\n",
    "    constexpr static T G = 10;\n",
    "    constexpr static T MAX_SPEED = 8;\n",
    "    constexpr static T MAX_TORQUE = 2;\n",
    "    constexpr static T DT = 0.05;\n",
    "    constexpr static T M = 1;\n",
    "    constexpr static T L = 1;\n",
    "    constexpr static T INITIAL_STATE_MIN_ANGLE = -rlt::math::PI<T>;\n",
    "    constexpr static T INITIAL_STATE_MAX_ANGLE = rlt::math::PI<T>;\n",
    "    constexpr static T INITIAL_STATE_MIN_SPEED = -1;\n",
    "    constexpr static T INITIAL_STATE_MAX_SPEED = 1;\n",
    "};\n",
    "\n",
    "template <typename T_T, typename T_TI, typename T_PARAMETERS = MyPendulumParameters<T_T>>\n",
    "struct MyPendulumSpecification{\n",
    "    using T = T_T;\n",
    "    using TI = T_TI;\n",
    "    using PARAMETERS = T_PARAMETERS;\n",
    "};\n",
    "\n",
    "template <typename T, typename TI>\n",
    "struct MyPendulumState{\n",
    "    static constexpr TI DIM = 2;\n",
    "    T theta;\n",
    "    T theta_dot;\n",
    "};\n",
    "template <typename TI>\n",
    "struct MyPendulumFourierObservation{\n",
    "    static constexpr TI DIM = 3; // cos(theta), sin(theta), theta_dot\n",
    "};\n",
    "\n",
    "template <typename T_SPEC>\n",
    "struct MyPendulum{\n",
    "    using SPEC = T_SPEC;\n",
    "    using T = typename SPEC::T;\n",
    "    using TI = typename SPEC::TI;\n",
    "    using Parameters = typename SPEC::PARAMETERS;\n",
    "    using State = MyPendulumState<T, TI>;\n",
    "    using Observation = MyPendulumFourierObservation<TI>;\n",
    "    using ObservationPrivileged = Observation;\n",
    "    static constexpr TI ACTION_DIM = 1;\n",
    "    static constexpr TI N_AGENTS = 1;\n",
    "};"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2d3a16ec-e307-4918-916c-3fa21095b72f",
   "metadata": {},
   "source": [
    "Next we can start defining operations on these datastructures. Note that they should be in the `rl_tools` namespace so that the RLtools algorithms (such as the on-/off-policy runner) can find and dispatch to them. If you want to use functions outside the `rl_tools` namespace you can just implement proxy functions that call your arbitrary functions. In our case we do not need dynamic memory allocation or initialization, hence just implement them as a [NOP](https://en.wikipedia.org/wiki/NOP_(code)). The `sample_initial_state` function samples random initial states and the `initial_state` provides a deterministic initial state (for deterministic evaluations). In the case of the pendulum a reasonable choice for the latter could be e.g. the state where it is hanging downwards with zero velocity. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "b3bdeeea-3caa-4b79-85c5-76683abc9af6",
   "metadata": {},
   "outputs": [],
   "source": [
    "namespace rl_tools{\n",
    "    template<typename DEVICE, typename SPEC>\n",
    "    static void malloc(DEVICE& device, const MyPendulum<SPEC>& env){}\n",
    "    template<typename DEVICE, typename SPEC>\n",
    "    static void free(DEVICE& device, const MyPendulum<SPEC>& env){}\n",
    "    template<typename DEVICE, typename SPEC>\n",
    "    static void init(DEVICE& device, const MyPendulum<SPEC>& env){}\n",
    "    template<typename DEVICE, typename SPEC, typename RNG>\n",
    "    static void sample_initial_parameters(DEVICE& device, const MyPendulum<SPEC>& env, typename MyPendulum<SPEC>::Parameters& parameters, RNG& rng){ }\n",
    "    template<typename DEVICE, typename SPEC>\n",
    "    static void initial_parameters(DEVICE& device, const MyPendulum<SPEC>& env, typename MyPendulum<SPEC>::Parameters& parameters){ }\n",
    "    template<typename DEVICE, typename SPEC, typename RNG>\n",
    "    static void sample_initial_state(DEVICE& device, const MyPendulum<SPEC>& env, const typename MyPendulum<SPEC>::Parameters& parameters, typename MyPendulum<SPEC>::State& state, RNG& rng){\n",
    "        state.theta     = random::uniform_real_distribution(typename DEVICE::SPEC::RANDOM(), SPEC::PARAMETERS::INITIAL_STATE_MIN_ANGLE, SPEC::PARAMETERS::INITIAL_STATE_MAX_ANGLE, rng);\n",
    "        state.theta_dot = random::uniform_real_distribution(typename DEVICE::SPEC::RANDOM(), SPEC::PARAMETERS::INITIAL_STATE_MIN_SPEED, SPEC::PARAMETERS::INITIAL_STATE_MAX_SPEED, rng);\n",
    "    }\n",
    "    template<typename DEVICE, typename SPEC>\n",
    "    static void initial_state(DEVICE& device, const MyPendulum<SPEC>& env, const typename MyPendulum<SPEC>::Parameters& parameters, typename MyPendulum<SPEC>::State& state){\n",
    "        state.theta = -rlt::math::PI<typename SPEC::T>;\n",
    "        state.theta_dot = 0;\n",
    "    }\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "71103725-03e5-44dd-98f5-4779acaa44ec",
   "metadata": {},
   "source": [
    "In the following we define some helper functions. Note: the usage of `rlt::math::xxx` for math functions seems tedious over e.g. `std::xxx` but it allows to dispatch to the right math implementations on GPUs and microcontrollers and hence running the same code on any device. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "fda09574-1230-4a9e-b76e-5a21146beb00",
   "metadata": {},
   "outputs": [],
   "source": [
    "template <typename T>\n",
    "T clip(T x, T min, T max){\n",
    "    x = x < min ? min : (x > max ? max : x);\n",
    "    return x;\n",
    "}\n",
    "template <typename DEVICE, typename T>\n",
    "T f_mod_python(const DEVICE& dev, T a, T b){\n",
    "    return a - b * rlt::math::floor(dev, a / b);\n",
    "}\n",
    "\n",
    "template <typename DEVICE, typename T>\n",
    "T angle_normalize(const DEVICE& dev, T x){\n",
    "    return f_mod_python(dev, (x + rlt::math::PI<T>), (2 * rlt::math::PI<T>)) - rlt::math::PI<T>;\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "255fce1d-7a17-44e0-9427-0ba3f7ce9db5",
   "metadata": {},
   "source": [
    "Next we implement the most important operations (which resemble the OpenAI gym interface): \n",
    "- `step`: Takes a `state`, executes an `action` and sets the `next_state`\n",
    "- `reward`: Returns the reward based on the `state`, `action`, and `next_state`\n",
    "- `observe`: Observes the `state`. For fully observed environments this should basically just flatten the `::State` data structure and possibly apply some observation noise. For partially observable environments the observation can also just contain parts of the information in the `::State`\n",
    "- `terminated`: Returns a boolean flag signalling if the `state` is a terminal state"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "689e66c0-bfad-4042-b4a3-7eddd3addf07",
   "metadata": {},
   "outputs": [],
   "source": [
    "namespace rl_tools{\n",
    "    template<typename DEVICE, typename SPEC, typename ACTION_SPEC, typename RNG>\n",
    "    typename SPEC::T step(DEVICE& device, const MyPendulum<SPEC>& env, const typename MyPendulum<SPEC>::Parameters& parameters, const typename MyPendulum<SPEC>::State& state, const Matrix<ACTION_SPEC>& action, typename MyPendulum<SPEC>::State& next_state, RNG& rng) {\n",
    "        static_assert(ACTION_SPEC::ROWS == 1);\n",
    "        static_assert(ACTION_SPEC::COLS == 1);\n",
    "        typedef typename SPEC::T T;\n",
    "        typedef typename SPEC::PARAMETERS PARAMS;\n",
    "        T u_normalised = get(action, 0, 0);\n",
    "        T u = PARAMS::MAX_TORQUE * u_normalised;\n",
    "        T g = PARAMS::G;\n",
    "        T m = PARAMS::M;\n",
    "        T l = PARAMS::L;\n",
    "        T dt = PARAMS::DT;\n",
    "\n",
    "        u = clip(u, -PARAMS::MAX_TORQUE, PARAMS::MAX_TORQUE);\n",
    "\n",
    "        T newthdot = state.theta_dot + (3 * g / (2 * l) * rlt::math::sin(device.math, state.theta) + 3.0 / (m * l * l) * u) * dt;\n",
    "        newthdot = clip(newthdot, -PARAMS::MAX_SPEED, PARAMS::MAX_SPEED);\n",
    "        T newth = state.theta + newthdot * dt;\n",
    "\n",
    "        next_state.theta = newth;\n",
    "        next_state.theta_dot = newthdot;\n",
    "        return SPEC::PARAMETERS::DT;\n",
    "    }\n",
    "    template<typename DEVICE, typename SPEC, typename ACTION_SPEC, typename RNG>\n",
    "    static typename SPEC::T reward(DEVICE& device, const MyPendulum<SPEC>& env, const typename MyPendulum<SPEC>::Parameters& parameters, const typename MyPendulum<SPEC>::State& state, const Matrix<ACTION_SPEC>& action, const typename MyPendulum<SPEC>::State& next_state, RNG& rng){\n",
    "        typedef typename SPEC::T T;\n",
    "        T angle_norm = angle_normalize(device.math, state.theta);\n",
    "        T u_normalised = get(action, 0, 0);\n",
    "        T u = SPEC::PARAMETERS::MAX_TORQUE * u_normalised;\n",
    "        T costs = angle_norm * angle_norm + 0.1 * state.theta_dot * state.theta_dot + 0.001 * (u * u);\n",
    "        return -costs;\n",
    "    }\n",
    "\n",
    "    template<typename DEVICE, typename SPEC, typename OBS_TYPE_SPEC, typename OBS_SPEC, typename RNG>\n",
    "    static void observe(DEVICE& device, const MyPendulum<SPEC>& env, const typename MyPendulum<SPEC>::Parameters& parameters, const typename MyPendulum<SPEC>::State& state, const MyPendulumFourierObservation<OBS_TYPE_SPEC>&, Matrix<OBS_SPEC>& observation, RNG& rng){\n",
    "        static_assert(OBS_SPEC::ROWS == 1);\n",
    "        static_assert(OBS_SPEC::COLS == 3);\n",
    "        typedef typename SPEC::T T;\n",
    "        set(observation, 0, 0, rlt::math::cos(device.math, state.theta));\n",
    "        set(observation, 0, 1, rlt::math::sin(device.math, state.theta));\n",
    "        set(observation, 0, 2, state.theta_dot);\n",
    "    }\n",
    "    template<typename DEVICE, typename SPEC, typename RNG>\n",
    "    RL_TOOLS_FUNCTION_PLACEMENT static bool terminated(DEVICE& device, const MyPendulum<SPEC>& env, const typename MyPendulum<SPEC>::Parameters& parameters, const typename MyPendulum<SPEC>::State state, RNG& rng){\n",
    "        using T = typename SPEC::T;\n",
    "        return false;\n",
    "    }\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d47b8b88-978a-490c-a1f2-b2b0f07e4fae",
   "metadata": {},
   "source": [
    "Since the training functions for the RL algorithms need to execute these operations they need to be defined before the RL data-collection and training operations. Hence in the following we include the RL ([Loop Interface](./07-The%20Loop%20Interface.ipynb)) operations. Note: when setting up your project you might want to assemble the previous data-structure definitions and operations into a header file so that all the `#include` directives are at the beginning of your code (still remember to include the header files for your environment before the RL operations). A recommended structure (that RLtools follows internally as well) is to put the the environment (for this example) into a `my_pendulum` directory. Then the datastructures are in `my_pendulum/my_pendulum.h` and the operations are in `my_pendulum/operations_generic.h`. `operations_generic.h` means that these are pure C++, dependency-free operations that can run on any device. If you need to use external libraries (e.g. `std::xxx` or `nlohmann::json`) you should separate out these operations into a device specific header, e.g. `my_pendulum/operations_cpu.h`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "850a41e6-a1ad-4d37-ac06-298d6478bbf1",
   "metadata": {},
   "outputs": [],
   "source": [
    "#include <rl_tools/rl/algorithms/ppo/loop/core/config.h>\n",
    "#include <rl_tools/rl/algorithms/ppo/loop/core/operations_generic.h>\n",
    "#include <rl_tools/rl/loop/steps/evaluation/config.h>\n",
    "#include <rl_tools/rl/loop/steps/evaluation/operations_generic.h>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "95dcf94b-4ac6-4e7e-829d-9a00a9325ff3",
   "metadata": {},
   "source": [
    "Finally, we can use our new environment and train it using the [Loop Interface](./07-The%20Loop%20Interface.ipynb) (same as in the previous chapter):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "5ef35895-a619-4fb7-937d-5ce950c42c33",
   "metadata": {},
   "outputs": [],
   "source": [
    "using DEVICE = rlt::devices::DEVICE_FACTORY<>;\n",
    "using RNG = decltype(rlt::random::default_engine(typename DEVICE::SPEC::RANDOM{}));\n",
    "using T = float;\n",
    "using TI = typename DEVICE::index_t;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "75070c5b-08de-4a14-8c22-5cf5226dbba4",
   "metadata": {},
   "outputs": [],
   "source": [
    "using PENDULUM_SPEC = MyPendulumSpecification<T, TI, MyPendulumParameters<T>>;\n",
    "using ENVIRONMENT = MyPendulum<PENDULUM_SPEC>;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "4d7817f0-2cd2-4b21-ae37-fbc977c372c7",
   "metadata": {},
   "outputs": [],
   "source": [
    "struct LOOP_CORE_PARAMETERS: rlt::rl::algorithms::ppo::loop::core::DefaultParameters<T, TI, ENVIRONMENT>{\n",
    "    static constexpr TI EPISODE_STEP_LIMIT = 200;\n",
    "    static constexpr TI TOTAL_STEP_LIMIT = 300000;\n",
    "    static constexpr TI STEP_LIMIT = TOTAL_STEP_LIMIT/(ON_POLICY_RUNNER_STEPS_PER_ENV * N_ENVIRONMENTS) + 1; // number of PPO steps\n",
    "};\n",
    "using LOOP_CORE_CONFIG = rlt::rl::algorithms::ppo::loop::core::Config<T, TI, RNG, ENVIRONMENT, LOOP_CORE_PARAMETERS>;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "8443382f-f275-4262-9a70-bf9eb23b6670",
   "metadata": {},
   "outputs": [],
   "source": [
    "template <typename NEXT>\n",
    "struct LOOP_EVAL_PARAMETERS: rlt::rl::loop::steps::evaluation::Parameters<T, TI, NEXT>{\n",
    "    static constexpr TI EVALUATION_INTERVAL = 4;\n",
    "    static constexpr TI NUM_EVALUATION_EPISODES = 10;\n",
    "    static constexpr TI N_EVALUATIONS = NEXT::CORE_PARAMETERS::STEP_LIMIT / EVALUATION_INTERVAL;\n",
    "};\n",
    "using LOOP_CONFIG = rlt::rl::loop::steps::evaluation::Config<LOOP_CORE_CONFIG, LOOP_EVAL_PARAMETERS<LOOP_CORE_CONFIG>>;\n",
    "using LOOP_STATE = typename LOOP_CONFIG::template State<LOOP_CONFIG>;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "afcc818f-34bb-44cc-ae45-1887b5225237",
   "metadata": {},
   "outputs": [],
   "source": [
    "DEVICE device;\n",
    "TI seed = 1;\n",
    "LOOP_STATE ls;\n",
    "rlt::malloc(device, ls);\n",
    "rlt::init(device, ls, seed);\n",
    "ls.actor_optimizer.parameters.alpha = 1e-3; // increasing the learning rate leads to faster training of the Pendulum-v1 environment\n",
    "ls.critic_optimizer.parameters.alpha = 1e-3;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "7719b694-13ce-4af2-87d6-becae738b0ce",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Step: 0/74 Mean return: -1300.94 Mean episode length: 200\n",
      "Step: 4/74 Mean return: -1230.49 Mean episode length: 200\n",
      "Step: 8/74 Mean return: -1117.24 Mean episode length: 200\n",
      "Step: 12/74 Mean return: -702.198 Mean episode length: 200\n",
      "Step: 16/74 Mean return: -545.253 Mean episode length: 200\n",
      "Step: 20/74 Mean return: -417.902 Mean episode length: 200\n",
      "Step: 24/74 Mean return: -212.429 Mean episode length: 200\n",
      "Step: 28/74 Mean return: -160.032 Mean episode length: 200\n",
      "Step: 32/74 Mean return: -156.78 Mean episode length: 200\n",
      "Step: 36/74 Mean return: -171.889 Mean episode length: 200\n",
      "Step: 40/74 Mean return: -146.42 Mean episode length: 200\n",
      "Step: 44/74 Mean return: -111.336 Mean episode length: 200\n",
      "Step: 48/74 Mean return: -196.376 Mean episode length: 200\n",
      "Step: 52/74 Mean return: -182.804 Mean episode length: 200\n",
      "Step: 56/74 Mean return: -160.231 Mean episode length: 200\n",
      "Step: 60/74 Mean return: -183.741 Mean episode length: 200\n",
      "Step: 64/74 Mean return: -413.776 Mean episode length: 200\n",
      "Step: 68/74 Mean return: -318.449 Mean episode length: 200\n"
     ]
    }
   ],
   "source": [
    "while(!rlt::step(device, ls)){\n",
    "}"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "C++17",
   "language": "C++17",
   "name": "xcpp17"
  },
  "language_info": {
   "codemirror_mode": "text/x-c++src",
   "file_extension": ".cpp",
   "mimetype": "text/x-c++src",
   "name": "c++",
   "version": "17"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}